In this assignment you will develop your initial concept note into a draft of a full project proposal. Treat this assignment as a “dry run” for developing a proposal for a grant or fellowship application, or for your Ph.D. prospectus.
Your proposal should include at least the following sections and information.
Front matter: Descriptive title, your name, date, reference to “SYS 7030 Time Series Analysis & Forecasting, Fall 2020”.
Abstract: A very brief summary of the project.
Give a narrative description of the problem you are addressing, and the methods you will use to address it. Provide context:
This work addresses the question: Why do people not use probabilistic forecasts for decision-making (National Research Council 2007)?
Describe the data set you will be analyzing, and where it comes from, how it was generated and collected. Identify the source of the data. Give a narrative description of the data-generating process: this piece is critical.
Since these will be time series data: identify the frequency of the data series (e.g., hourly, monthly), and the period of record.
esales <- dbGetQuery(db,'SELECT * from eia_elec_sales_va_all_m') # SQL code to retrieve data from a table in the remote database
# str(esales)
esales <- as_tibble(esales) # Convert dataframe to a 'tibble' for tidyverse work
# str(esales)
# Reference: https://arrow.apache.org/docs/r/
# if(!('arrow' %in% installed.packages())) install.packages('arrow')
library(arrow)
write_feather(esales, "esales.feather")
# Close connection -- this is good practice
dbDisconnect(db)
dbUnloadDriver(db_driver)
library(arrow)
Attaching package: 'arrow'
The following object is masked from 'package:utils':
timestamp
esales <- read_feather("esales.feather")
str(esales)
tibble [233 × 4] (S3: tbl_df/tbl/data.frame)
$ value: num [1:233] 8282 7839 8889 9368 9209 ...
$ date : Date[1:233], format: "2020-05-01" "2020-04-01" ...
$ year : int [1:233] 2020 2020 2020 2020 2020 2019 2019 2019 2019 2019 ...
$ month: int [1:233] 5 4 3 2 1 12 11 10 9 8 ...
print(esales)
# A tibble: 233 x 4
value date year month
<dbl> <date> <int> <int>
1 8282. 2020-05-01 2020 5
2 7839. 2020-04-01 2020 4
3 8889. 2020-03-01 2020 3
4 9368. 2020-02-01 2020 2
5 9209. 2020-01-01 2020 1
6 10038. 2019-12-01 2019 12
7 9291. 2019-11-01 2019 11
8 8757. 2019-10-01 2019 10
9 9874. 2019-09-01 2019 9
10 10912. 2019-08-01 2019 8
# … with 223 more rows
# References: https://www.tidyverse.org/, https://dplyr.tidyverse.org/
esales %>%
filter(year == 2019) %>%
filter(value > 9000) %>%
print()
# A tibble: 10 x 4
value date year month
<dbl> <date> <int> <int>
1 10038. 2019-12-01 2019 12
2 9291. 2019-11-01 2019 11
3 9874. 2019-09-01 2019 9
4 10912. 2019-08-01 2019 8
5 11527. 2019-07-01 2019 7
6 9903. 2019-06-01 2019 6
7 9147. 2019-05-01 2019 5
8 9466. 2019-03-01 2019 3
9 9148. 2019-02-01 2019 2
10 10925. 2019-01-01 2019 1
esales %>%
group_by(month) %>%
summarise(mean = mean(value)) -> mean_esales_by_month
`summarise()` ungrouping output (override with `.groups` argument)
esales %>%
mutate(sales_TWh = value/1000) %>%
select(-value)
# filter(data object, condition) : syntax for filter() command
#Reference: https://ggplot2.tidyverse.org/
ggplot(data=esales, aes(x=date,y=value)) +
geom_line() + xlab("Year") + ylab("Virginia monthly total electricity sales (GWh)")
# install.packages("tsibble")
library(tsibble) # Reference: https://tsibble.tidyverts.org/articles/intro-tsibble.html
Attaching package: 'tsibble'
The following object is masked from 'package:lubridate':
interval
esales %>% as_tsibble(index = date) -> esales_tbl_ts
print(esales_tbl_ts)
# A tsibble: 233 x 4 [1D]
value date year month
<dbl> <date> <int> <int>
1 9576. 2001-01-01 2001 1
2 7820. 2001-02-01 2001 2
3 8070. 2001-03-01 2001 3
4 7153. 2001-04-01 2001 4
5 7224. 2001-05-01 2001 5
6 8264. 2001-06-01 2001 6
7 8896. 2001-07-01 2001 7
8 9404. 2001-08-01 2001 8
9 7753. 2001-09-01 2001 9
10 7272. 2001-10-01 2001 10
# … with 223 more rows
library(lubridate) # Make it easy to deal with dates
esales_tbl_ts %>% filter(month==3)
esales_tbl_ts %>% filter(month(date)==3)
esales_tbl_ts %>%
select(date, sales_GWh = value) -> elsales_tbl_ts
print(elsales_tbl_ts)
# A tsibble: 233 x 2 [1D]
date sales_GWh
<date> <dbl>
1 2001-01-01 9576.
2 2001-02-01 7820.
3 2001-03-01 8070.
4 2001-04-01 7153.
5 2001-05-01 7224.
6 2001-06-01 8264.
7 2001-07-01 8896.
8 2001-08-01 9404.
9 2001-09-01 7753.
10 2001-10-01 7272.
# … with 223 more rows
hist(elsales_tbl_ts$sales_GWh, breaks=40)
# install.packages("feasts"), Reference: https://feasts.tidyverts.org/
library(feasts)
Loading required package: fabletools
elsales_tbl_ts %>%
mutate(Month = yearmonth(date)) %>%
as_tsibble(index = Month) -> vaelsales_tbl_ts
vaelsales_tbl_ts %>% gg_season(sales_GWh, labels = "both") + ylab("Virginia electricity sales (GWh)")
# install.packages('tsibbledata')
library(tsibbledata)
aus_production
aus_production %>% gg_season(Electricity)
aus_production %>% gg_season(Beer)
vaelsales_tbl_ts %>%
gg_subseries(sales_GWh)
# aus_production %>% gg_subseries(Beer)
vaelsales_tbl_ts %>% filter(month(Month) %in% c(3,6,9,12)) %>% gg_lag(sales_GWh, lags = 1:2)
vaelsales_tbl_ts %>% filter(month(Month) == 1) %>% gg_lag(sales_GWh, lags = 1:2)
vaelsales_tbl_ts %>% ACF(sales_GWh) %>% autoplot()
# if(!('fpp3' %in% installed.packages())) install.packages('fpp3')
library(fpp3)
── Attaching packages ────────────────────────────────────────────── fpp3 0.3 ──
✓ fable 0.2.1
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
x lubridate::date() masks base::date()
x dplyr::filter() masks stats::filter()
x tsibble::interval() masks lubridate::interval()
x dplyr::lag() masks stats::lag()
# decompose(vaelsales_tbl_ts)
vaelsales_tbl_ts %>%
model(STL(sales_GWh ~ trend(window=21) + season(window='periodic'), robust = TRUE)) %>%
components() %>%
autoplot()
vaelsales_tbl_ts %>%
mutate(ln_sales_GWh = log(sales_GWh)) %>%
model(STL(ln_sales_GWh ~ trend(window=21) + season(window='periodic'),
robust = TRUE)) %>%
components() %>%
autoplot()
vaelsales_tbl_ts %>%
features(sales_GWh, feat_stl)
vaelsales_tbl_ts %>%
features(sales_GWh, feature_set(pkgs="feasts"))
Warning: `n_flat_spots()` is deprecated as of feasts 0.1.5.
Please use `longest_flat_spot()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
library(tsibbledata) # Data sets package
print(global_economy)
# A tsibble: 15,150 x 9 [1Y]
# Key: Country [263]
Country Code Year GDP Growth CPI Imports Exports Population
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AFG 1960 537777811. NA NA 7.02 4.13 8996351
2 Afghanistan AFG 1961 548888896. NA NA 8.10 4.45 9166764
3 Afghanistan AFG 1962 546666678. NA NA 9.35 4.88 9345868
4 Afghanistan AFG 1963 751111191. NA NA 16.9 9.17 9533954
5 Afghanistan AFG 1964 800000044. NA NA 18.1 8.89 9731361
6 Afghanistan AFG 1965 1006666638. NA NA 21.4 11.3 9938414
7 Afghanistan AFG 1966 1399999967. NA NA 18.6 8.57 10152331
8 Afghanistan AFG 1967 1673333418. NA NA 14.2 6.77 10372630
9 Afghanistan AFG 1968 1373333367. NA NA 15.2 8.90 10604346
10 Afghanistan AFG 1969 1408888922. NA NA 15.0 10.1 10854428
# … with 15,140 more rows
global_economy %>% filter(Country=="Sweden") %>% print()
# A tsibble: 58 x 9 [1Y]
# Key: Country [1]
Country Code Year GDP Growth CPI Imports Exports Population
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sweden SWE 1960 14842870293. NA 9.21 23.4 23.0 7484656
2 Sweden SWE 1961 16147160123. 5.68 9.41 21.7 22.3 7519998
3 Sweden SWE 1962 17511477311. 4.26 9.86 21.4 21.9 7561588
4 Sweden SWE 1963 18954132366. 5.33 10.1 21.5 21.9 7604328
5 Sweden SWE 1964 21137242561. 6.82 10.5 21.9 22.3 7661354
6 Sweden SWE 1965 23260320646. 3.82 11.0 22.5 21.9 7733853
7 Sweden SWE 1966 25302033132. 2.09 11.7 21.9 21.4 7807797
8 Sweden SWE 1967 27463409202. 3.37 12.2 21.0 21.1 7867931
9 Sweden SWE 1968 29143383491. 3.64 12.5 21.6 21.6 7912273
10 Sweden SWE 1969 31649203886. 5.01 12.8 23.0 22.8 7968072
# … with 48 more rows
global_economy %>%
filter(Country=="Sweden") %>%
autoplot(GDP) +
ggtitle("GDP for Sweden") + ylab("$US billions")
global_economy %>% model(trend_model = TSLM(GDP ~ trend())) -> fit
Warning: 7 errors (1 unique) encountered for trend_model
[7] 0 (non-NA) cases
fit
fit %>% filter(Country == "Sweden") %>% residuals()
fit %>% filter(Country == "Sweden") %>% residuals() %>% autoplot(.resid)
global_economy %>%
filter(Country=="Sweden") %>%
autoplot(log(GDP)) +
ggtitle("ln(GDP) for Sweden") + ylab("$US billions")
global_economy %>%
model(trend_model = TSLM(log(GDP) ~ trend())) -> logfit
Warning: 7 errors (1 unique) encountered for trend_model
[7] 0 (non-NA) cases
logfit %>% filter(Country == "Sweden") %>% residuals() %>% autoplot()
Plot variable not specified, automatically selected `.vars = .resid`
global_economy %>% model(trend_model = TSLM(log(GDP) ~ log(Population))) -> fit3
Warning: 7 errors (1 unique) encountered for trend_model
[7] 0 (non-NA) cases
fit3 %>% filter(Country == "Sweden") %>% residuals() %>% autoplot()
Plot variable not specified, automatically selected `.vars = .resid`
fit %>% forecast(h = "3 years") -> fcast3yrs
fcast3yrs
fcast3yrs %>% filter(Country == "Sweden", Year == 2020) %>% str()
fable [1 × 5] (S3: fbl_ts/tbl_ts/tbl_df/tbl/data.frame)
$ Country: Factor w/ 263 levels "Afghanistan",..: 232
$ .model : chr "trend_model"
$ Year : num 2020
$ GDP : dist [1:1]
..$ 3:List of 2
.. ..$ mu : num 5.45e+11
.. ..$ sigma: num 5.34e+10
.. ..- attr(*, "class")= chr [1:2] "dist_normal" "dist_default"
..@ vars: chr "GDP"
$ .mean : num 5.45e+11
- attr(*, "key")= tibble [1 × 3] (S3: tbl_df/tbl/data.frame)
..$ Country: Factor w/ 263 levels "Afghanistan",..: 232
..$ .model : chr "trend_model"
..$ .rows : list<int> [1:1]
.. ..$ : int 1
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
- attr(*, "index")= chr "Year"
..- attr(*, "ordered")= logi TRUE
- attr(*, "index2")= chr "Year"
- attr(*, "interval")= interval [1:1] 1Y
..@ .regular: logi TRUE
- attr(*, "response")= chr "GDP"
- attr(*, "dist")= chr "GDP"
- attr(*, "model_cn")= chr ".model"
fcast3yrs %>%
filter(Country=="Sweden") %>%
autoplot(global_economy) +
ggtitle("GDP for Sweden") + ylab("$US billions")
Model residuals:
Your data: \(y_1, y_2, \ldots, y_T\)
Fitted values: \(\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_T\)
Model residuals: \(e_t = y_t - \hat{y}_t\)
Forecast errors:
augment(fit)
augment(fit) %>% filter(Country == "Sweden") %>%
ggplot(aes(x = .resid)) +
geom_histogram(bins = 20) +
ggtitle("Histogram of residuals")
Write down an equation (or set of equations) that represent the data-generating process formally.
For the electricity sales data, maybe the process looks like:
\[ y_t = Trend_t X Seasonal_t X Residual_t \] \[ y_t = \beta_0 + \beta_1 t + \beta_2 m + \varepsilon_t \]
# ETS forecasts
USAccDeaths %>%
ets() %>%
forecast() %>%
autoplot()
str(taylor)
plot(taylor)
If applicable: describe any transformations of the data (e.g., differencing, taking logs) you need to make to get the data into a form (e.g., linear) ready for numerical analysis.
What kind of process is it? \(AR(p)\)? White noise with drift? Something else?
Write down an equation expressing each realization of the stochastic process \(y_t\) as a function of other observed data (which could include lagged values of \(y\)), unobserved parameters (\(\beta\)), and an error term (\(\varepsilon_t\)). Ex:
\[y = X\cdot\beta + \varepsilon\] Add a model of the error process. Ex: \(\varepsilon \sim N(0, \sigma^2 I_T)\).
Describe how the formal statistical model captures and aligns with the narrative of the data-generating process. Flag any statistical challenges raised by the data generating process, e.g. selection bias; survivorship bias; omitted variables bias, etc.
Describe what information you wish to extract from the data. Do you wish to… estimate the values of the unobserved model parameters? create a tool for forecasting? estimate the exceedance probabilities for future realizations of \(y_t\)?
Describe your plan for getting this information. OLS regression? Some other statistical technique?
If you can: describe briefly which computational tools you will use (e.g., R), and which packages you expect to draw on.
Prepare your proposal using Markdown. (You may find it useful to generate your Markdown file from some other tool, e.g. R Markdown in R Studio.) Submit your proposal by pushing it to your repo within the course organization on Github. When your proposal is ready, notify the instructor by also creating a submission for this assignment on Collab. Please also upload a PDF version of your proposal to Collab as part of your submission.
Depending on your prior experience, you may find this assignment challenging. Treat this assignment as an opportunity to make progress on your own research program. Make your proposal as complete as you can. But note that this assignment is merely the First Draft. You will have more opportunity to refine your work over the next two months, in consultation with the instructor, your advisor, and your classmates.
National Research Council. 2007. Completing the Forecast: Characterizing and Communicating Uncertainty for Better Decisions Using Weather and Climate Forecasts. https://doi.org/10.17226/11699.